ARROW-17303: [Java][Dataset] Read Arrow IPC files by NativeDatasetFactory (#13760) by igor-suhorukov · Pull Request #13811 · apache/arrow

igor-suhorukov · 2022-08-07T14:17:12Z

This PR allow developers to create Dataset from ARROW IPC files in JVM code like:
FileSystemDatasetFactory factory = new FileSystemDatasetFactory(rootAllocator(), NativeMemoryPool.getDefault(), FileFormat.ARROW_IPC, arrowDatasetURL);

It is foundation for Apache Spark arrow data source to process huge existing partitioned datasets in ARROW file format without additional data format conversion

…tory

…tory (#13760)

github-actions · 2022-08-07T14:17:29Z

https://issues.apache.org/jira/browse/ARROW-17303

github-actions · 2022-08-07T14:17:30Z

⚠️ Ticket has not been started in JIRA, please click 'Start Progress'.

lidavidm · 2022-08-08T11:31:22Z

Thanks for the PR!

@davisusanibar @lwhite1 would one of you mind taking a look?

Is "osm_nodes.arrow" from OpenStreetMap? Are there licensing concerns around the data? Arrow already has test data files for use and/or files can be generated in-process.

igor-suhorukov · 2022-08-08T11:38:13Z

@lidavidm yes, it is 10 records from Openstreetmap planet dump. Could you please provide more information how to generate test data in ARROW file format to test dataset API or where existing test data located?

lidavidm · 2022-08-08T12:01:42Z

It'd be something like

File out = TMP.newFile();
Schema schema = new Schema(Collections.singletonList(Field.nullable("ints", new ArrowType.Int(32, true))));
try (VectorSchemaRoot root = VectorSchemaRoot.create(schema, allocator);
     FileOutputStream fileOutputStream = new FileOutputStream(file);
     ArrowFileWriter writer = new ArrowFileWriter(root, /*dictionaryProvider=*/null, sink)) {
    // Fill root with data
    IntVector ints = (IntVector) root.getVector(0);
    ints.setSafe(0, 0);
    root.setRowCount(1);
    // ...
    writer.start();
    writer.writeBatch();
    writer.end();
}
// Use out.getPath()...

…ctory (#13760)

igor-suhorukov · 2022-08-08T13:18:20Z

@lidavidm thank you for advise. OSM data was deleted from PR. Please check updated test TestFileSystemDataset#testBaseArrowIpcRead
Is it fit project test approach?

lwhite1 · 2022-08-08T18:41:15Z

Hi @igor-suhorukov This looks good to me except I wish the tests were more robust. (The same is true for the Parquet test that you're emulating, but I guess that's out of scope here.)

This kind of test - relying on checking sizes and names - doesn't provide much assurance that we won't see bug reports when people import complex data types or otherwise tap into some of the more advanced functionality.

lidavidm · 2022-08-08T18:43:33Z

@lwhite1 we could file another JIRA for that?

lidavidm · 2022-08-08T18:46:07Z

Also a general note re: Larry's comment: we currently have a mix of JUnit 4/5, ad-hoc test helpers like the one here, and a mix of assertion libraries; it might be good to start incrementally cleaning that up (e.g. it would be much easier to test complex types if there were an easy setup to parameterize a test and have the data generated for you).

ARROW-6931 is sort of related, and ARROW-4740 (we added JUnit5 but didn't port the existing tests)

lwhite1 · 2022-08-08T18:48:26Z

I think it's fine to open a Jira for better Parqet testing. It would be preferable, IMO, to get better testing for the new functionality here, rather than file a ticket for it.

…

On Mon, Aug 8, 2022 at 2:43 PM David Li ***@***.***> wrote: @lwhite1 <https://github.com/lwhite1> we could file another JIRA for that? — Reply to this email directly, view it on GitHub <#13811 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AA2FPAXRLPLNF4DE6OQQNWLVYFIOBANCNFSM552RMTDA> . You are receiving this because you were mentioned.Message ID: ***@***.***>

lidavidm · 2022-08-08T18:52:39Z

I suggest a separate ticket because 1) generating test data is very unergonomic (as seen here) and could use some thought across different areas of the codebase and 2) I'd rather push down the testing to the appropriate levels (IPC, Parquet, and eventually CSV should share most of their testing code, the same way the C++ library is organized; and most of the type-specific tests should be done for the C Data Interface)

lwhite1 · 2022-08-08T18:59:47Z

I suggest a separate ticket because 1) generating test data is very unergonomic (as seen here) and could use some thought across different areas of the codebase and 2) I'd rather push down the testing to the appropriate levels (IPC, Parquet, and eventually CSV should share most of their testing code, the same way the C++ library is organized; and most of the type-specific tests should be done for the C Data Interface)

Ok. Works for me.

lwhite1

LGTM

lidavidm · 2022-08-08T19:02:31Z

I filed https://issues.apache.org/jira/browse/ARROW-17342

lidavidm · 2022-08-08T19:44:59Z

FWIW, looking at the JIRA/GH issue, this will only handle "IPC" files, not Arrow stream files - there's work needed on the C++ side if that is something we want to cover

igor-suhorukov · 2022-08-08T19:59:51Z

Thanks a lot for clarification @lidavidm @lwhite1 and for your time. Don't worries about refactoring. I have such experience with Spring/ElasticSearch projects refactoring, fix tech debt and cleanup - it can be contribution of crowd when Arrow project will be more mature - separate activities for new joiners. Good start for someone

ursabot · 2022-08-09T05:52:22Z

Benchmark runs are scheduled for baseline = a2f3666 and contender = 78351ce. 78351ce is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Finished ⬇️0.34% ⬆️0.0%] test-mac-arm
[Finished ⬇️0.0% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️0.14% ⬆️0.04%] ursa-thinkcentre-m75q
Buildkite builds:
[Finished] 78351cec ec2-t3-xlarge-us-east-2
[Finished] 78351cec test-mac-arm
[Finished] 78351cec ursa-i9-9960x
[Finished] 78351cec ursa-thinkcentre-m75q
[Finished] a2f3666d ec2-t3-xlarge-us-east-2
[Finished] a2f3666d test-mac-arm
[Finished] a2f3666d ursa-i9-9960x
[Finished] a2f3666d ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

This PR bumps Apache Arrow version from 9.0.0 to 10.0.0. Main changes related to PyAmber: ## Java/Scala side: - JDBC Driver for Arrow Flight SQL ([13800](apache/arrow#13800)) - Initial implementation of immutable Table API ([14316](apache/arrow#14316)) - Substrait, transaction, cancellation for Flight SQL ([13492](apache/arrow#13492)) - Read Arrow IPC, CSV, and ORC files by NativeDatasetFactory ([13811](apache/arrow#13811), [13973](apache/arrow#13973), [14182](apache/arrow#14182)) - Add utility to bind Arrow data to JDBC parameters ([13589](apache/arrow#13589)) ## Python side: - The batch_readahead and fragment_readahead arguments for scanning Datasets are exposed in Python ([ARROW-17299](https://issues.apache.org/jira/browse/ARROW-17299)). - ExtensionArrays can now be created from a storage array through the pa.array(..) constructor ([ARROW-17834](https://issues.apache.org/jira/browse/ARROW-17834)). - Converting ListArrays containing ExtensionArray values to numpy or pandas works by falling back to the storage array ([ARROW-17813](https://issues.apache.org/jira/browse/ARROW-17813)). - Casting Tables to a new schema now honors the nullability flag in the target schema ([ARROW-16651](https://issues.apache.org/jira/browse/ARROW-16651)).

…tory (apache#13760) (apache#13811) This PR allow developers to create Dataset from ARROW IPC files in JVM code like: `FileSystemDatasetFactory factory = new FileSystemDatasetFactory(rootAllocator(), NativeMemoryPool.getDefault(), FileFormat.ARROW_IPC, arrowDatasetURL);` It is foundation for Apache Spark arrow data source to process huge existing partitioned datasets in ARROW file format without additional data format conversion Lead-authored-by: Igor Suhorukov <igor.suhorukov@gmail.com> Co-authored-by: igor.suhorukov <igor.suhorukov@gmail.com> Signed-off-by: David Li <li.davidm96@gmail.com>

This PR bumps Apache Arrow version from 9.0.0 to 10.0.0. Main changes related to PyAmber: ## Java/Scala side: - JDBC Driver for Arrow Flight SQL ([13800](apache/arrow#13800)) - Initial implementation of immutable Table API ([14316](apache/arrow#14316)) - Substrait, transaction, cancellation for Flight SQL ([13492](apache/arrow#13492)) - Read Arrow IPC, CSV, and ORC files by NativeDatasetFactory ([13811](apache/arrow#13811), [13973](apache/arrow#13973), [14182](apache/arrow#14182)) - Add utility to bind Arrow data to JDBC parameters ([13589](apache/arrow#13589)) ## Python side: - The batch_readahead and fragment_readahead arguments for scanning Datasets are exposed in Python ([ARROW-17299](https://issues.apache.org/jira/browse/ARROW-17299)). - ExtensionArrays can now be created from a storage array through the pa.array(..) constructor ([ARROW-17834](https://issues.apache.org/jira/browse/ARROW-17834)). - Converting ListArrays containing ExtensionArray values to numpy or pandas works by falling back to the storage array ([ARROW-17813](https://issues.apache.org/jira/browse/ARROW-17813)). - Casting Tables to a new schema now honors the nullability flag in the target schema ([ARROW-16651](https://issues.apache.org/jira/browse/ARROW-16651)).

igor-suhorukov added 3 commits August 7, 2022 17:10

ARROW-17303: [Java][Dataset] Read Arrow IPC files by NativeDatasetFac…

a8ed61d

…tory

ARROW-17303: [Java][Dataset] Read Arrow IPC files by NativeDatasetFac…

85f3ef9

…tory (#13760)

Merge remote-tracking branch 'origin/master'

9a15296

github-actions Bot added the Component: Java label Aug 7, 2022

igor-suhorukov added 2 commits August 8, 2022 15:28

Merge branch 'apache:master' into master

9997bbd

ARROW-17303: [Java][Dataset] Read Arrow IPC files by NativeDatasetFa…

9f3acbc

…ctory (#13760)

lwhite1 approved these changes Aug 8, 2022

View reviewed changes

lidavidm approved these changes Aug 8, 2022

View reviewed changes

lidavidm linked an issue Aug 8, 2022 that may be closed by this pull request

Read "arrow" (IPC and streaming) files usning org.apache.arrow.dataset.jni.NativeDatasetFactory in Java API #13760

Closed

lidavidm merged commit 78351ce into apache:master Aug 8, 2022

igor-suhorukov mentioned this pull request Aug 8, 2022

Read Apache ARROW IPC dataset by using FileSystemDatasetFactory and FileFormat.ARROW_IPC oap-project/gazelle_plugin#1060

Open

Yicong-Huang mentioned this pull request Dec 8, 2022

Bump Apache Arrow to 10.0.0 apache/texera#1764

Merged

asfimport mentioned this pull request Nov 26, 2024

[Java] Improve testing of Dataset bindings apache/arrow-java#304

Open

Uh oh!

Conversation

igor-suhorukov commented Aug 7, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions Bot commented Aug 7, 2022

Uh oh!

github-actions Bot commented Aug 7, 2022

Uh oh!

lidavidm commented Aug 8, 2022

Uh oh!

igor-suhorukov commented Aug 8, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lidavidm commented Aug 8, 2022

Uh oh!

igor-suhorukov commented Aug 8, 2022

Uh oh!

lwhite1 commented Aug 8, 2022

Uh oh!

lidavidm commented Aug 8, 2022

Uh oh!

lidavidm commented Aug 8, 2022

Uh oh!

lwhite1 commented Aug 8, 2022 via email

Uh oh!

lidavidm commented Aug 8, 2022

Uh oh!

lwhite1 commented Aug 8, 2022

Uh oh!

lwhite1 left a comment

Choose a reason for hiding this comment

Uh oh!

lidavidm commented Aug 8, 2022

Uh oh!

lidavidm commented Aug 8, 2022

Uh oh!

igor-suhorukov commented Aug 8, 2022

Uh oh!

ursabot commented Aug 9, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

igor-suhorukov commented Aug 7, 2022 •

edited

Loading

igor-suhorukov commented Aug 8, 2022 •

edited

Loading